Extracting Multiwords From Large Document Collection Based N-Gram

نویسنده

  • M. Nirmala
چکیده

Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collection, and maybe quite useful specially for highly refining generic queries. A new approach is proposed to find collocation from text document. As mentioned earlier, a collocation is just a set of words occurring together more often than by chance in a corpus. Collocations are extracted based on the frequency of the joint occurrence of the words as well as that of the individual occurrences of each of the words in the whole text. Intuitively, when a set of words is extracted as a collocation, then the joint occurrence of the words must be high in comparison to that of the constituent individual words. Keywords— Multiword terms (MWTs), Information, Collocations, Extraction , Text Document.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiword Frequency Analysis Based on MEDLINE N-gram Set

Multiwords are vital to better precision and recall in NLP applications. The Lexical Systems Group (LSG) developed an effective approach to add multiwords to the SPECIALIST Lexicon from the MEDLINE n-gram set. This paper describes a frequency analysis on LexMultiwords (LMWs) and acronym expansions based on the word count (WC) in MEDLINE. Results show most LMWs locate in the low WC range with be...

متن کامل

Generating Multiwords from MEDLINE in the SPECIALIST Lexicon

Multiwords are vital to better NLP systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in NLP applications, etc. The Lexical Systems Group (LSG) enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource. This paper describes a new systematic approach to lexical multiword acquisition from MEDL...

متن کامل

Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon

Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This pape...

متن کامل

Generating the MEDLINE N-Gram Set

The MEDLINE n-gram set is a very useful resource in Natural Language Processing (NLP) and Medical Language Processing (MLP). Currently, there is no MEDLINE n-gram set available in the public domain. Due to the large scale of data, it is a challenge to generate MEDLINE n-grams to fit into a research schedule with limited computer resources. The Lexical System Group (LSG) developed an algorithm t...

متن کامل

Information Extraction from Web-Scale N-Gram Data

Search engines are increasingly relying on structured data to provide direct answers to certain types of queries. However, extracting such structured data from text is challenging, especially due to the scarcity of explicitly expressed knowledge. Even when relying on large document collections, pattern-based information extraction approaches typically expose only insufficient amounts of informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013